首页> 外文OA文献 >Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches
【2h】

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

机译:高性能计算生物学中容错的自动化   使用多代理方法的作业

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Background: Large-scale biological jobs on high-performance computing systemsrequire manual intervention if one or more computing cores on which theyexecute fail. This places not only a cost on the maintenance of the job, butalso a cost on the time taken for reinstating the job and the risk of losingdata and execution accomplished by the job before it failed. Approaches whichcan proactively detect computing core failures and take action to relocate thecomputing core's job onto reliable cores can make a significant step towardsautomating fault tolerance. Method: This paper describes an experimental investigation into the use ofmulti-agent approaches for fault tolerance. Two approaches are studied, thefirst at the job level and the second at the core level. The approaches areinvestigated for single core failure scenarios that can occur in the executionof parallel reduction algorithms on computer clusters. A third approach isproposed that incorporates multi-agent technology both at the job and corelevel. Experiments are pursued in the context of genome searching, a popularcomputational biology application. Result: The key conclusion is that the approaches proposed are feasible forautomating fault tolerance in high-performance computing systems with minimalhuman intervention. In a typical experiment in which the fault tolerance isstudied, centralised and decentralised checkpointing approaches on an averageadd 90% to the actual time for executing the job. On the other hand, in thesame experiment the multi-agent approaches add only 10% to the overallexecution time.
机译:背景:如果高性能计算系统上执行的一个或多个计算核心发生故障,则需要人工干预。这不仅在维护工作上付出了成本,而且在恢复工作上花费了时间,并在工作失败之前损失了数据和执行完成的风险。可以主动检测计算核心故障并采取措施将计算核心的工作重新定位到可靠核心上的方法,可以朝着实现自动容错迈出重要的一步。方法:本文描述了使用多主体方法进行容错的实验研究。研究了两种方法,第一种在工作级别,第二种在核心级别。研究了针对在计算机集群上执行并行约简算法时可能发生的单核故障情况的方法。提出了第三种方法,该方法在工作和核心级别都融合了多代理技术。实验是在流行的计算生物学应用基因组搜索的背景下进行的。结果:关键结论是,所提出的方法对于以最少的人为干预自动执行高性能计算系统中的容错性是可行的。在研究容错能力的典型实验中,集中式和分散式检查点方法平均使执行作业的实际时间增加了90%。另一方面,在同一实验中,多主体方法仅使总执行时间增加了10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号